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DEGREES OF FREEDOM IN LASSO PROBLEMS 

By Ryan J. Tibshirani and Jonathan Taylor^ 

Carnegie Mellon University and Stanford University 

We derive the degrees of freedom of the lasso fit, placing no as- 
sumptions on the predictor matrix X. Like the well-known result of 
Zou, Hastie and Tibshirani [Ann. Statist. 35 (2007) 2173-2192], which 
gives the degrees of freedom of the lasso fit when X has full column 
rank, we express our result in terms of the active set of a lasso so- 
lution. We extend this result to cover the degrees of freedom of the 
generalized lasso fit for an arbitrary predictor matrix X (and an ar- 
bitrary penalty matrix D). Though our focus is degrees of freedom, 
we establish some intermediate results on the lasso and generalized 
lasso that may be interesting on their own. 

1. Introduction. We study degrees of freedom, or the "effective number 
of parameters," in £i-penalized linear regression problems. In particular, for 
a response vector y S M*^, predictor matrix X S M"^^ and tuning parameter 
A > 0, we consider the lasso problem [Chen, Donoho and Saunders (1998), 
Tibshirani (1996)] 

(1) /3Gargmin^||y-X/3||i + A 



The above notation emphasizes the fact that the solution /3 may not be 
unique [such nonuniqueness can occur if rank(X) <p]. Throughout the pa- 
per, when a function f -.D ^ M" may have a nonunique minimizer over its 
domain D, we write aigmm^^j^f{x) to denote the set of minimizing x values, 
that is, argmin3,g^/(x) = {xeD: f{x) = m.m^(zD fix)}. 

A fundamental result on the degrees of freedom of the lasso fit was shown 
by Zou, Hastie and Tibshirani (2007). The authors show that if y follows 
a normal distribution with spherical covariance, y ~ N{fj,, cx^/), and X, X are 
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considered fixed with rank(X) =p, then 

(2) df(X/3)=E|^|, 

where A = A{y) denotes the active set of the unique lasso solution at ?/, 
and 1^1 is its cardinality. This is quite a well-known result, and is some- 
times used to informally justify an application of the lasso procedure, as 
it says that number of parameters used by the lasso fit is simply equal to 
the (average) number of selected variables. However, we note that the as- 
sumption rank(X) = p implies that p < n; in other words, the degrees of 
freedom result (2) does not cover the important "high-dimensional" case 
p> n. In this case, the lasso solution is not necessarily unique, which raises 
the questions: 

• Can we still express degrees of freedom in terms of the active set of a lasso 
solution? 

• If so, which active set (solution) would we refer to? 

In Section 3, we provide answers to these questions, by proving a stronger 
result when X is a general predictor matrix. We show that the subspace 
spanned by the columns of X in ^ is almost surely unique, where "almost 
surely" means for almost every y G M"". Furthermore, the degrees of freedom 
of the lasso fit is simply the expected dimension of this column space. 
We also consider the generalized lasso problem, 

(3) /3eargmin^||y-X/3||2 + A||I)/5||i, 

where D G is a penalty matrix, and again the notation emphasizes the 

fact that (3 need not be unique [when rank(X) < p]. This of course reduces to 
the usual lasso problem (1) when D = I, and Tibshirani and Taylor (2011) 
demonstrate that the formulation (3) encapsulates several other important 
problems — including the fused lasso on any graph and trend filtering of any 
order — by varying the penalty matrix D. The same paper shows that if y is 
normally distributed as above, and X,D,X are fixed with rank(X) = p, then 
the generalized lasso fit has degrees of freedom 

(4) df(X/3)=E[nullity(Z5_B)]. 

Here B = B{y) denotes the boundary set of an optimal subgradient to the 
generalized lasso problem at y (equivalently, the boundary set of a dual 
solution at y), -D-b denotes the matrix D after having removed the rows 
that are indexed by and nullity(-D_B) = dim(null(Z)_B)), the dimension 
of the null space of D^q. 

It turns out that examining (4) for specific choices of D produces a number 
of interpretable corollaries, as discussed in Tibshirani and Taylor (2011). For 
example, this result implies that the degrees of freedom of the fused lasso 
fit is equal to the expected number of fused groups, and that the degrees of 
freedom of the trend filtering fit is equal to the expected number of knots 
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+ (A; — 1), where k is the order of the polynomial. The result (4) assumes 
that rank(X) = p and does not cover the case p > n; in Section 4, we derive 
the degrees of freedom of the generalized lasso fit for a general X (and still 
a general D). As in the lasso case, we prove that there exists a linear subspace 
X(null(D_B)) that is almost surely unique, meaning that it will be the same 
under different boundary sets B corresponding to different solutions of (3). 
The generalized lasso degrees of freedom is then the expected dimension of 
this subspace. 

Our assumptions throughout the paper are minimal. As was already 
mentioned, we place no assumptions whatsoever on the predictor matrix 
X G W^^P or on the penalty matrix D G M"*^^, considering them fixed and 
nonrandom. We also consider A > fixed. For Theorems 1, 2 and 3 we as- 
sume that y is normally distributed, 

(5) yr^N{^L,a^I) 

for some (unknown) mean vector n G M" and marginal variance cj^ > 0. This 
assumption is only needed in order to apply Stein's formula for degrees 
of freedom, and none of the other lasso and generalized lasso results in 
the paper, namely Lemmas 3 through 10, make any assumption about the 
distribution of y. 

This paper is organized as follows. The rest of the Introduction contains 
an overview of related work, and an explanation of our notation. Section 2 
covers some relevant background material on degrees of freedom and con- 
vex polyhedra. Though the connection may not be immediately obvious, 
the geometry of polyhedra plays a large role in understanding problems (1) 
and (3), and Section 2.2 gives a high-level view of this geometry before the 
technical arguments that follow in Sections 3 and 4. In Section 3, we derive 
two representations for the degrees of freedom of the lasso fit, given in Theo- 
rems 1 and 2. In Section 4, we derive the analogous results for the generalized 
lasso problem, and these are given in Theorem 3. As the lasso problem is 
a special case of the generalized lasso problem (corresponding to D = I), 
Theorems 1 and 2 can actually be viewed as corollaries of Theorem 3. The 
reader may then ask: why is there a separate section dedicated to the lasso 
problem? We give two reasons: first, the lasso arguments are simpler and 
easier to follow than their generalized lasso counterparts; second, we cover 
some intermediate results for the lasso problem that are interesting in their 
own right and that do not carry over to the generalized lasso perspective. 
Section 5 contains some final discussion. 

1.1. Related work. All of the degrees of freedom results discussed here 
assume that the response vector has distribution y ~ A^(/i, cr^/), and that 
the predictor matrix X is fixed. To the best of our knowledge, Efron et al. 
(2004) were the first to prove a result on the degrees of freedom of the lasso 
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fit, using the lasso solution path with A moving from oo to 0. The authors 
showed that when the active set reaches size k along this path, the lasso 
fit has degrees of freedom exactly k. This result assumes that X has full 
column rank and further satisfies a restrictive condition called the "positive 
cone condition," which ensures that as A decreases, variables can only enter, 
and not leave, the active set. Subsequent results on the lasso degrees of 
freedom (including those presented in this paper) differ from this original 
result in that they derive degrees of freedom for a fixed value of the tuning 
parameter A, and not a fixed number of steps k taken along the solution 
path. 

As mentioned previously, Zou, Hastie and Tibshirani (2007) established 
the basic lasso degrees of freedom result (for fixed A) stated in (2). This is 
analogous to the path result of Efron et al. (2004); here degrees of freedom 
is equal to the expected size of the active set (rather than simply the size) 
because for a fixed A the active set is a random quantity, and can hence 
achieve a random size. The proof of (2) appearing in Zou, Hastie and Tib- 
shirani (2007) relies heavily on properties of the lasso solution path. As also 
mentioned previously, Tibshirani and Taylor (2011) derived an extension 
of (2) to the generalized lasso problem, which is stated in (4) for an arbi- 
trary penalty matrix D. Their arguments are not based on properties of the 
solution path, but instead come from a geometric perspective much like the 
one developed in this paper. 

Both of the results (2) and (4) assume that rank(X) = p; the current 
work extends these to the case of an arbitrary matrix X, in Theorems 1, 2 
(the lasso) and 3 (the generalized lasso). In terms of our intermediate re- 
sults, a version of Lemmas 5, 6 corresponding to rank(X) =p appears in 
Zou, Hastie and Tibshirani (2007), and a version of Lemma 9 correspond- 
ing to rank(X) =p appears in Tibshirani and Taylor (2011) [furthermore, 
Tibshirani and Taylor (2011) only consider the boundary set representation 
and not the active set representation]. Lemmas 1, 2 and the conclusions 
thereafter, on the degrees of freedom of the projection map onto a convex 
polyhedron, are essentially given in Meyer and Woodroofe (2000), though 
these authors state and prove the results in a different manner. 

In preparing a draft of this manuscript, it was brought to our attention 
that other authors have independently and concurrently worked to extend 
results (2) and (4) to the general X case. Namely, Dossal et al. (2011) 
prove a result on the lasso degrees of freedom, and Vaiter et al. (2011) prove 
a result on the generalized lasso degrees of freedom, both for an arbitrary X. 
These authors' results express degrees of freedom in terms of the active sets 
of special (lasso or generalized lasso) solutions. Theorems 2 and 3 express 
degrees of freedom in terms of the active sets of any solutions, and hence the 
appropriate application of these theorems provides an alternative verification 
of these formulas. We discuss this in detail in the form of remarks following 
the theorems. 
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1.2. Notation. In this paper, we use col(A), row{A) and null(A) to de- 
note the column space, row space and null space of a matrix A, respec- 
tively; we use rank(74) and nullity (A) to denote the dimensions of col(^) 
[equivalently, row(74)] and null(A), respectively. We write A~^ for the the 
Moore-Penrose pseudoinverse of A; for a rectangular matrix A, recall that 
A'^ = {A'^A)~^A'^. We write Pl to denote the projection matrix onto a linear 
subspace L, and more generally, Pcix) to denote the projection of a point x 
onto a closed convex set C. For readability, we sometimes write {a, b) (in- 
stead of a^b) to denote the inner product between vectors a and b. 

For a set of indices R = {ii, . . . , i^} C {1, . . . , m} satisfying ii < • • • < z^, 
and a vector x S we use xr to denote the subvector x/j = (xjj , . . . , Xj^)"^ G 
M'^. We denote the complementary subvector by X-r = x^i^___^mj\fi G M™"'^. 
The notation is similar for matrices. Given another subset of indices 5 = 
{jii • • • di} ^ {1) • • • iP} with ji < • • • < ji, and a matrix A G W^^P, we use 



to denote the submatrix 



A A' ■ ' 



A A. . 



nkxi 



In words, rows are indexed by and columns are indexed by S. When 
combining this notation with the transpose operation, we assume that the 
indexing happens first, so that Ajj^ = (74(^ 5))-^. As above, negative signs 
are used to denote the complementary set of rows or columns; for exam- 
ple, A(^_jig^ = A{{i,...,m}\R,S)- To extract only rows or only columns, we 
abbreviate the other dimension by a dot, so that ^(_r,.) = ^(i?,{i,...,p}) and 
^(.^5) = ^({i,...,m},5); to extract a single row or column, we use ^(j,.) = ^({i},-) 
or ^(.j) = j4(. jjj). Finally, and most importantly, we introduce the following 
shorthand notation: 

• For the predictor matrix X G M"^^, we let Xs = ^{■,s)- 

• For the penalty matrix D G M™^^, we let = D(^ji .y 

In other words, the default for X is to index its columns, and the default 
for D is to index its rows. This convention greatly simplifies the notation 
in expressions that involve multiple instances of Xg or Dn] however, its use 
could also cause a great deal of confusion, if not properly interpreted by the 
reader! 

2. Preliminary material. The following two sections describe some back- 
ground material needed to follow the results in Sections 3 and 4. 

2.1. Degrees of freedom. If the data vector y G M" is distributed ac- 
cording to the homoskedastic model y ~ (/i,(T^/), meaning that the com- 
ponents of y are uncorrelated, with yi having mean /Uj and variance o"^ 
for i = 1, . . . , n, then the degrees of freedom of a function g : M" — )• M" with 
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9{y) = {gi{y), ■ ■ -^dniy)) , is defined as 

(6) d%) = ^f;Cov(ff,(y),y,)- 

i=l 

This definition is often attributed to Efron (1986) or Hastie and Tibshirani 
(1990), and is interpreted as the "effective number of parameters" used by 
the fitting procedure g. Note that for the linear regression fit of y E MJ^ onto 
a fixed and fuh column rank predictor matrix X S M""^^, we have g{y) = y = 
XX^y, and df(y) = tr{XX^) =p, which is the number of fitted coefficients 
(one for each predictor variable). Furthermore, we can decompose the risk 
of y, denoted by Risk(y) = E||y — /i||2, as 

Risk(y) = E||y - y\\l - na"^ + 2pa'^, 

a well-known identity that leads to the derivation of the Cp statistic [Mal- 
lows (1973)]. For a general fitting procedure g, the motivation for the defini- 
tion (6) comes from the analogous decomposition of the quantity Risk((7) = 

m9{y)-i^\\h 

n 

(7) Risk(5) = E\\g{y) - yg -na^ + lj^ Cov {g^{y),y^). 

i=l 

Therefore a large difference between risk and expected training error implies 
a large degrees of freedom. 

Why is the concept of degrees of freedom important? One simple answer 
is that it provides a way to put different fitting procedures on equal footing. 
For example, it would not seem fair to compare a procedure that uses an 
effective number of parameters equal to 100 with another that uses only 10. 
However, assuming that these procedures can be tuned to varying levels of 
adaptivity (as is the case with the lasso and generalized lasso, where the 
adaptivity is controlled by A), one could first tune the procedures to have 
the same degrees of freedom, and then compare their performances. Doing 
this over several common values for degrees of freedom may reveal, in an 
informal sense, that one procedure is particularly efficient when it comes to 
its parameter usage versus another. 

A more detailed answer to the above question is based the risk decompo- 
sition (7). The decomposition suggests that an estimate di{g) of degrees of 
freedom can be used to form an estimate of the risk, 

(8) msk{g) = My) - y\\l - na^ + 2a^di{g). 

Furthermore, it is straightforward to check that an unbiased estimate of 
degrees of freedom leads to an unbiased estimate of risk; that is, di{g) = 
E[df(5()] implies Risk((7) = E[Risk((7)]. Hence, the risk estimate (8) can be 
used to choose between fitting procedures, assuming that unbiased estimates 
of degrees of freedom are available. [It is worth mentioning that bootstrap 
or Monte Carlo methods can be helpful in estimating degrees of freedom (6) 
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when an analytic form is difficult to obtain.] The natural extension of this 
idea is to use the risk estimate (8) for tuning parameter selection. If we 
suppose that g depends on a tuning parameter A € A, denoted g = gxiv)-, 
then in principle one could minimize the estimated risk over A to select an 
appropriate value for the tuning parameter, 

(9) A = argmin \\g\{y) - y\\l - na'^ + 2cr^df(5A)- 

agA 

This is a computationally efficient alternative to selecting the tuning pa- 
rameter by cross-validation, and it is commonly used (along with similar 
methods that replace the factor of 2 above with a function of n or p) in pe- 
nalized regression problems. Even though such an estimate (9) is commonly 
used in the high-dimensional setting {p> n), its asymptotic properties are 
largely unknown in this case, such as risk consistency, or relatively efficiency 
compared to the cross-validation estimate. 

Stein (1981) proposed the risk estimate (8) using a particular unbiased 
estimate of degrees of freedom, now commonly referred to as Stein's unbiased 
risk estimate (SURE). Stein's framework requires that we strengthen our 
distributional assumption on y and assume normality, as stated in (5). We 
also assume that the function g is continuous and almost differentiable. 
(The precise definition of almost differentiability is not important here, but 
the interested reader may take it to mean that each coordinate function gi 
is absolutely continuous on almost every line segment parallel to one of 
the coordinate axes.) Given these assumptions. Stein's main result is an 
alternate expression for degrees of freedom, 

(10) df{g)=E[{V-g){y)], 

where the function V ■ g = dgi/dyi is called the divergence of g. Imme- 
diately following is the unbiased estimate of degrees of freedom, 

(11) df(<7) = (V-5)(y). 

We pause for a moment to refiect on the importance of this result. From its 
definition (6), we can see that the two most obvious candidates for unbiased 
estimates of degrees of freedom are 

^ n \ ^ 

— ^9i(.y){yi-f^i) and —J^idiiy) -H9i{y)])yi- 

1=1 i=l 

To use the first estimate above, we need to know /i (remember, this is ulti- 
mately what we are trying to estimate!). Using the second requires knowing 
E[(7(i/)], which is equally impractical because this invariably depends on /i. 
On the other hand, Stein's unbiased estimate (11) does not have an explicit 
dependence on moreover, it can be analytically computed for many fitting 
procedures g. For example. Theorem 2 in Section 3 shows that, except for y 
in a set of measure zero, the divergence of the lasso fit is equal to rank(X_4) 
with A = A{y) being the active set of a lasso solution at y. Hence, Stein's 
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formula allows for the unbiased estimation of degrees of freedom (and subse- 
quently, risk) for a broad class of fitting procedures g — something that may 
have not seemed possible when working from the definition directly. 

2.2. Projections onto polyhedra. A set C C M" is called a convex poly- 
hedron, or simply a polyhedron, if C is the intersection of finitely many 
half-spaces, 

k 

(12) C= p|{xG]R":afx<6i}, 

i=l 

where oi , . . . , G and fei , . . . , 6^ G M. (Note that we do not require bound- 
edness here; a bounded polyhedron is sometimes called a polytope.) See 
Figure 1 for an example. There is a rich theory on polyhedra; the defini- 
tive reference is Griinbaum (2003), and another good reference is Schneider 
(1993). As this is a paper on statistics and not geometry, we do not attempt 
to give an extensive treatment of the properties of polyhedra. We do, how- 
ever, give two properties (in the form of two lemmas) that are especially 
important with respect to our statistical problem; our discussion will also 
make it clear why polyhedra are relevant in the first place. 

From its definition (12), it follows that a polyhedron is a closed convex 
set. The first property that we discuss does not actually rely on the special 
structure of polyhedra, but only on convexity. For any closed convex set 
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C C M" and any point x G M", there is a unique point u £ C minimizing 
\\x — u\\2- To see this, note that if G C is another minimizer, v ^ u, then 
by convexity w = {u + v)/2 £ C , and \\x — w\\2 < \\x — u\\2/2 + \\x — v\\2/2 = 
\\x — u\\2, a contradiction. Therefore, the projection map onto C is indeed 
weh defined, and we write this as Pc '■ M" — t- C, 

Pc(x) = argmin \\x — u\\2- 

For the usual hnear regression problem, where y G M" is regressed onto X G 
j^nxp^ the fit Xp can be written in terms of the projection map onto the 
polyhedron C = col(X), as in Xj3{y) = XX^y = Pco\(x){y)- Furthermore, 
for both the lasso and generalized lasso problems, (1) and (3), it turns out 
that we can express the fit as the residual from projecting onto a suitable 
polyhedron C C R", that is, 

X^{y) = {I-Pc){y)=y-Pc{y). 

This is proved in Lemma 3 for the lasso and in Lemma 8 for the generalized 
lasso (the polyhedron C depends on X, A for the lasso case, and on X, D, A 
for the generalized lasso case). Our first lemma establishes that both the 
projection map onto a closed convex set and the residual map are nonex- 
pansive, hence continuous and almost differ entiable everywhere. These are 
the conditions needed to apply Stein's formula. 

Lemma 1. For any closed convex set C C M", both the projection map 
Pc : — >• C and the residual projection map I — Pc '■ — )• are nonex- 
pansive. That is, they satisfy 

\\Pcix) -Fc{y)\\2<\\x-y\\2 for any x,y £R"-, and 

\\{I-Pc)ix)-{I-Pc){y)\\2<\\x-y\\2 foranyx,yGR^. 

Therefore, Pc and I — Pc are both continuous and almost differentiable. 

The proof can be found in Appendix A.l. Lemma 1 will be quite useful 
later in the paper, as it will allow us to use Stein's formula to compute the 
degrees of freedom of the lasso and generalized lasso fits, after showing that 
these fits are indeed the residuals from projecting onto closed convex sets. 

The second property that we discuss uses the structure of polyhedra. 
Unlike Lemma 1, this property will not be used directly in the following 
sections of the paper; instead, we present it here to give some intuition with 
respect to the degrees of freedom calculations to come. The property can be 
best explained by looking back at Figure 1. Loosely speaking, the picture 
suggests that we can move the point x around a bit and it will still project to 
the same face of C. Another way of saying this is that there is a neighborhood 
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of X on which Pc is simply the projection onto an affine subspace. This would 
not be true if x is in some exceptional set, which is made up of rays that 
emanate from the corners of C, like the two drawn in the bottom right corner 
of figure. However, the union of such rays has measure zero, so the map Pc 
is locally an affine projection, almost everywhere. This idea can be stated 
formally as follows. 

Lemma 2. Let C C be a polyhedron. For almost every x G M", there is 
an associated neighborhood U of x, linear subspace L C and point a G M", 
such that the projection map restricted to U, Pc :U ^ C, is 

Pc{y) = Pliy - a) + a foryeU, 

which is simply the projection onto the affine subspace L + a. 

The proof is given in Appendix A. 2. These last two properties can be 
used to derive a general expression for the degrees of freedom of the fitting 
procedure g{y) = {I — Pc){y), when C C M" is a polyhedron. [A similar 
formula holds for g{y) = Pc{y).] Lemma 1 tells us that / — Pc is continuous 
and almost differentiable, so we can use Stein's formula (10) to compute its 
degrees of freedom. Lemma 2 tells us that for almost every y G M", there is 
a neighborhood U of y, linear subspace L C R", and point a G M", such that 

(/ - Pc){y') =y'-PL{y'-a)-a = {I- PlW - a) for y' G U. 

Therefore, 

(V • (/ - Pc)){y) = tr(/ - Pi) = n - dim(L), 

and an expectation over y gives 

di{I-Pc)=n-E[dim{L)]. 

It should be made clear that the random quantity in the above expectation 
is the linear subspace L = L{y), which depends on y. 

In a sense, the remainder of this paper is focused on describing dim(L) — 
the dimension of the face of C onto which the point y projects — in a mean- 
ingful way for the lasso and generalized lasso problems. Section 3 considers 
the lasso problem, and we show that L can be written in terms of the equicor- 
relation set of the fit at y. We also show that L can be described in terms of 
the active set of a solution at y. In Section 4 we show the analogous results 
for the generalized lasso problem, namely, that L can be written in terms of 
either the boundary set of an optimal subgradient at y (the analogy of the 
equicorrelation set for the lasso) or the active set of a solution at y. 

3. The lasso. In this section we derive the degrees of freedom of the lasso 
fit, for a general predictor matrix X. All of our arguments stem from the 
Karush-Kuhn-Tucker (KKT) optimality conditions, and we present these 
first. We note that many of the results in this section can be alternatively 
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derived using the lasso dual problem. Appendix A. 5 explains this connec- 
tion more precisely. For the current work, we avoid the dual perspective 
simply to keep the presentation more self-contained. Finally, we remind the 
reader that Xs is used to extract columns of X corresponding to an index 
set S. 

3.1. The KKT conditions and the underlying polyhedron. The KKT con- 
ditions for the lasso problem (1) can be expressed as 



Here 7 G is a subgradient of the function f{x) = \\x\\i evaluated at x = /3. 
Hence /? is a minimizer in (1) if and only if (3 satisfies (13) and (14) for 
some 7. Directly from the KKT conditions, we can show that X/3 is the 
residual from projecting y onto a polyhedron. 

Lemma 3. For any X and A > 0, the lasso fit X/3 can be written as 
X/3{y) = {I — Pc){y), where C C M"- is the polyhedron 

C7 = {'ueM":||A:^u||oo< A}. 

Proof. Given a point y G M", its projection 9 = Pc{y) onto a closed 
convex set C C R" can be characterized as the unique point satisfying 



(13) 



X^(y - X/3) = A7, 



(14) 




if ft / 0, 
if ft = 0. 



(15) 



{y-e,e-u)>Q 



for all u £ C. 



Hence defining 6 = y — X/3{y), and C as in the lemma, we want to show 
that (15) holds for all n G C. Weh, 




Consider the first term above. Taking an inner product with f3 on both 
sides of (13) gives {X/3,y — Xf3) = A||/3||i. Furthermore, the li norm can be 
characterized in terms of its dual norm, the £00 norm, as in 



A||/3||i= max («;,/3). 



||tw||oo<A 



Therefore, continuing from (16), we have 



{y-e,e-u)= max (w,/3) - (A:^u,/3), 



|ui||oo<A 



12 



R. J. TIBSHIRANI AND J. TAYLOR 



which is > for all u£ C, and we have hence proved that 6 = y — X/3{y) = 
Pciy)- To show that C is indeed a polyhedron, note that it can be written as 

p 

C= p|({nGM":Xfn< A}n{nGM":Xfu> -A}), 

i=l 

which is a finite intersection of half-spaces. □ 

Showing that the lasso fit is the residual from projecting y onto a polyhe- 
dron is important, because it means that X(3{y) is nonexpansive as a func- 
tion of y, and hence continuous and almost differentiable, by Lemma 1. 
This establishes the conditions that are needed to apply Stein's formula for 
degrees of freedom. 

In the next section, we define the equicorrelation set £, and show that the 
lasso fit and solutions both have an explicit form in terms of £. Following 
this, we derive an expression for the lasso degrees of freedom as a function 
of the equicorrelation set. 

3.2. The equicorrelation set. According to Lemma 3, the lasso fit X/3 
is always unique (because projection onto a closed convex set is unique). 

Therefore, even though the solution /3 is not necessarily unique, the optimal 
subgradient 7 is unique, because it can be written entirely in terms of XP, 
as shown by (13). We define the unique equicorrelation set E as 

(17) £: = {iG{l,...,p}:|7,| = l}. 
An alternative definition for the equicorrelation set is 

(18) f = {i G {1, . . . ,p} : \Xj{y - X/3)| = A}, 

which explains its name, as E can be thought of as the set of variables 
that have equal and maximal absolute inner product (or correlation for 
standardized variables) with the residual. 

The set £^ is a natural quantity to work with because we can express the 
lasso fit and the set of lasso solutions in terms of by working directly from 
equation (13). First we let 

(19) s = sign(7£)=sign(Xj(y-X/3)), 

the signs of the inner products of the equicorrelation variables with the 
residual. Since /3_£; = by definition of the subgradient, the E block of the 
KKT conditions can be rewritten as 

(20) Xl{y - Xek) = As. 

Because As E row(X£-), we can write As = Xj{Xj)~^ \s, so rearranging (20) 
we get 

XjXei3e = X^{y-{Xj)+\s). 
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Therefore, the lasso fit Xf3 = Xgfig is 

(21) xp = Xe{Xe)^{y-{Xj)+\s), 
and any lasso solution must be of the form 

(22) /3_£ = and = (X£)+(y - (Xj)+As) + 6, 

where h G null(X£:). In the case that null(X£:) = {0} — for example, this holds 
if rank(X) =p — the lasso solution is unique and is given by (22) with 6 = 0. 
But in general, when mi\i{Xg) ^ {0}, it is important to note that not every 
b S im\\.{Xg) necessarily leads to a lasso solution in (22); the vector b must 
also preserve the signs of the nonzero coefficients; that is, it must also satisfy 

sign([(X^)+(y-(XJ)+As)], + 6,) = 5^ 

(23) 

for each i such that [{X£)+{y - (Xj)+As)] ■ + bi^Q. 
Otherwise, 7 would not be a proper subgradient of ||/3||i. 

3.3. Degrees of freedom in terms of the equicorrelation set. Using rel- 
atively simple arguments, we can derive a result on the lasso degrees of 
freedom in terms of the equicorrelation set. Our arguments build on the 
following key lemma. 

Lemma 4. For any y,X and A > 0, a lasso solution is given by 

(24) /3_^ = and k = {XeViy - {Xj)+ Xs), 

where £ and s are the equicorrelation set and signs, as defined in (17) 
and (19). 

In other words. Lemma 4 says that the sign condition (23) is always 
satisfied by taking 6 = 0, regardless of the rank of X. This result is inspired 
by the LARS work of Efron et al. (2004), though it is not proved in the 
LARS paper; see Appendix B of Tibshirani (2011) for a proof. 

Next we show that, almost everywhere in y, the equicorrelation set and 
signs are locally constant functions of y. To emphasize their functional de- 
pendence on y, we write them as £{y) and s{y). 

Lemma 5. For almost every y E , there exists a neighborhood U of y 
such that £{y') = £{y) and s{y) = s{y') for all y' £ U. 

Proof. Define 

= U U{z G : [(X,)+](,,)(z - {XjyXs) = 0}, 
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where the first union above is taken over all subsets £ {1, . . . ,p} and sign 
vectors s € {— l,!}'^', but we exclude sets £ for which a row of {X^)~^ is 
entirely zero. The set M is a finite union of affine subspaces of dimension 
n — 1, and therefore has measure zero. 

Let y ^ M, and abbreviate the equicorrelation set and signs as S = £{y) 
and s = s{y). We may assume no row of {X£)~^ is entirely zero. (Otherwise, 
this implies that has a zero column, which implies that A = 0, a trivial 
case for this lemma.) Therefore, as y ^ M, this means that the lasso solution 
given in (24) satisfies f3i{y) / for every i £ £. 

Now, for a new point y' , consider defining 

^-£{y')=0 and ^eiv') = {Xe)'' {y' - ixj)+ Xs). 

We need to verify that this is indeed a solution at y' , and that the corre- 
sponding fit has equicorrelation set £ and signs s. First notice that, after 
a straightforward calculation, 

Xj{y' - xKy')) = Xjiy' - Xe{X£)+{y' - (Xj)+As)) = Xs. 

Also, by the continuity of the function / : R" — ;> M^"!^! , 

fix) = X^,{x - XeiXe^ix - (Xj)+A5)), 

there exists a neighborhood Ui of y such that 

WX^dy' - X^iy'Moo = W^-siy' - XeiXe^iy' - (Xj)+As))|L < A 

for all y' G f/i. Hence X(3{y') has equicorrelation set £{y') =£ and signs 
s{y')=s. 

To check that P{y') is a lasso solution at y\ we consider the function 

g{x) = {X£)+{x-{Xj)+Xs). 

The continuity of g implies that there exists a neighborhood U2 of y such 
that 

/3i{y') = [{Xj)+{y'-{Xj)+Xs)\^0 for i € f , and 

sign0e{y')) = sign{{Xe)+{y' - {Xj)+Xs)) 
for each y' G C/2. Defining U = C/i Pi C/2 completes the proof. □ 

This immediately implies the following theorem. 

Theorem 1 (Lasso degrees of freedom, equicorrelation set representa- 
tion). Assume that y follows a normal distribution (5). For any X and 
A > 0, the lasso fit Xf3 has degrees of freedom 

df(X^) =E[rank(X^)], 

where £ = £{y) is the equicorrelation set of the lasso fit at y. 
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Proof. By Lemmas 1 and 3 we know that Xf3{y) is continuous and 
almost differentiable, so we can use Stein's formula (10) for degrees of free- 
dom. By Lemma 5, we know that £ = £{y) and s = s{y) are locally constant 
for all y ^M. Therefore, taking the divergence of the fit in (21), we get 

(V • X/3)(y) = iv{X£{Xs)+) = rank(X^). 

Taking an expectation over y (and recalling that M has measure zero) gives 
the result. □ 

Next, we shift our focus to a different subset of variables: the active 
set A. Unlike the equicorrelation set, the active set is not unique, as it 
depends on a particular choice of lasso solution. Though it may seem that 
such nonuniqueness could present complications, it turns out that all of the 
active sets share a special property; namely, the linear subspace col(X_4) is 
the same for any choice of active set A, almost everywhere in y. This invari- 
ance allows us to express the degrees of freedom of lasso fit in terms of the 
active set (or, more precisely, any active set). 

3.4. The active set. Given a particular solution /3, we define the active 
set A as 

(25) A = {iCi{l,...,p}:pi^Q}. 

This is also called the support of /3 and written A = supp(/3). From (22), 
we can see that we always have A^£, and different active sets A can be 
formed by choosing h € null(X£:) to satisfy the sign condition (23) and also 

[{Xe)^{y - {Xl)^\s)\ + 6, = for i i A. 

If rank(X) = p, then 6 = 0, so there is a unique active set, and further- 
more A = £ for almost every y £ (in particular, this last statement holds 
for y ^ A/", where M is the set of measure zero set defined in the proof of 
Lemma 5). For the signs of the coefficients of active variables, we write 

(26) r = sign(^^), 

and we note that r = s_4. 

By similar arguments as those used to derive expression (21) for the fit 
in Section 3.2, the lasso fit can also be written as 

(27) XP = {XA){XA)+[y - {Xl)+\r) 

for the active set A and signs r of any lasso solution /3. If we could take 
the divergence of the fit in the expression above, and simply ignore the 
dependence of A and r on y (treat them as constants) , then this would give 
(V • Xf3){y) = rank(X_4). In the next section, we show that treating A and r 
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as constants in (27) is indeed correct, for almost every y. This property then 
implies that the linear subspace col(X_4) is invariant under any choice of 
active set A, almost everywhere in y; moreover, it implies that we can write 
the lasso degrees of freedom in terms of any active set. 

3.5. Degrees of freedom in terms of the active set. We first establish a re- 
sult on the local stability of A{y) and r{y) [written in this way to emphasize 
their dependence on y, through a solution /3(y)]. 

Lemma 6. There is a set C M", of measure zero, with the following 
property: for y ^ Ai, and for any lasso solution j3{y) with active set A{y) 
and signs r{y), there is a neighborhood U of y such that every point y' &U 
yields a lasso solution /3{y') with the same active set A{y') = A{y) and the 
same active signs r{y') =r{y). 

The proof is similar to that of Lemma 5, except that it is longer and 
somewhat more complicated, so it is delayed until Appendix A. 3. Combined 
with expression (27) for the lasso fit. Lemma 6 now implies an invariance of 
the subspace spanned by the active variables. 

Lemma 7. For the same set Ai Q M" as in Lemma 6, and for any y ^ M, 
the linear subspace col{X_/[) is invariant under all sets A = A{y) defined in 
terms of a lasso solution f3{y) at y. 

Proof. Let y ^ and let f5{y) be a solution with active set A = A{y) 
and signs r = r{y). Let U be the neighborhood of y as constructed in the 
proof of Lemma 6; on this neighborhood, solutions exist with active set A 
and signs r. Hence, recalling (27), we know that for every y' € U , 

XKV') = {XA){XA)^{y' - iX^)^Xr). 

Now suppose that A* and r* are the active set and signs of another lasso 
solution at y. Then, by the same arguments, there is a neighborhood U* of y 
such that 

X^{y') = {Xj,,){Xj,,)+{y' - {X^,r\r*) 

for all y' gU* . By the uniqueness of the fit, we have that for each y' £ U OU* , 

{XA){XA)^{y' - (^5) + Ar) = {XA^){XA^) + {y' - (X^O^^Ar*). 

Since U DU* is open, for any z G coI{Xa), there is an e > such that 
y + £z £ U nU* . Plugging y' = y + ez into the above equation implies that 
z G col(X^*), so co[{Xa) ^ col(X^*). A similar argument gives col(X_4*) C 
col(X^), completing the proof. □ 

Again, this immediately leads to the following theorem. 
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Theorem 2 (Lasso degrees of freedom, active set representation). As- 
sume that y follows a normal distribution (5). For any X and A > 0, the 
lasso fit Xf3 has degrees of freedom 

df(X/3) =E[rank(X^)], 

where A = A{y) is the active set corresponding to any lasso solution /3{y) 
at y. 

Note: By Lemma 7, rank(X^) is an invariant quantity, not depending on 
the choice of active set (coming from a lasso solution), for almost every y. 
This makes the above result well defined. 

Proof of Theorem 2. We can apply Stein's formula (10) for degrees 
of freedom, because X(3{y) is continuous and almost differentiable by Lem- 
mas 1 and 3. Let A = A{y) and r = r{y) be the active set and active signs 
of a lasso solution at y ^ Ai, with A4 as in Lemma 7. By this same lemma, 
there exists a lasso solution with active set A and signs r at every point y' 
in some neighborhood U oi y, and therefore, taking the divergence of the 
fit (27), we get 

(V • Xp){y) = tr(X^(X^) + ) = rank(X^). 
Taking an expectation over y completes the proof. □ 

Remark (Equicorrelation set representation). The proof of Lemma 6 
showed that, for almost every y, the equicorrelation set £ is actually the 
active set A of the particular lasso solution defined in (24). Hence Theorem 1 
can be viewed as a corollary of Theorem 2. 

Remark (Full column rank X). When rank(X) =p, the lasso solution 
is unique, and there is only one active set A. And as the columns of X are 
linearly independent, we have rank(X) = |^|, so the result of Theorem 2 
reduces to 

df(X/3)=E|^|, 
as shown in Zou, Hastie and Tibshirani (2007). 

Remark (The smallest active set). An interesting result on the lasso 
degrees of freedom was recently and independently obtained by Dossal et al. 
(2011). Their result states that, for a general X, 

df{X(3) = E\A*\, 

where |^*| is the smallest cardinality among all active sets of lasso solutions. 
This actually follows from Theorem 2, by noting that for any y there ex- 
ists a lasso solution whose active set A* corresponds to linear independent 
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predictors X_4*, so rank(X_4*) = |^*| [e.g., see Theorem 3 in Appendix B 
of Rosset, Zhu and Hastie (2004)], and furthermore, for almost every y no 
active set can have a cardinaUty smaher than |^*|, as this would contradict 
Lemma 7. 



Remark (The elastic net). Consider the elastic net problem [Zou and 
Hastie (2005)], 

(28) /3 = argmin^lly - X/3g + + 



where we now have two tuning parameters Ai, A2 > 0. Note that our notation 
above emphasizes the fact that there is always a unique solution to the elastic 
net criterion, regardless of the rank of X . This property (among others, such 
as stability and predictive ability) is considered an advantage of the elastic 
net over the lasso. We can rewrite the elastic net problem (28) as a (full 
column rank) lasso problem, 

2 



P = argmin — 



X 



+ Ai 

2 



and hence it can be shown (although we omit the details) that the degrees 
of freedom of the elastic net fit is 

df(X^) = E[tT{XA{X^X^ + X2ir'X^)], 

where A = A{y) is the active set of the elastic net solution at y. 

Remark (The lasso with intercept). It is often more appropriate to 
include an (unpenalized) intercept coefficient in the lasso model, yielding 
the problem 

(29) (/3o,^)e argmin h\y - ^^i - Xf]\\l + XM],, 

(/3o,/3)eKP+i ^ 

where 1 = (1, 1, . . . , 1) e M" is the vector of all Is. Defining M = I- G 
j^nxn^ we note that the fit of problem (29) can be written as PqI + XP = 
(/ — M)y + MX 13, and that /3 solves the usual lasso problem 

/3 G argmin ^llMy - MXf]\\l + A]]/?]]!. 



Now it follows (again we omit the details) that the fit of the lasso problem 
with intercept (29) has degrees of freedom 

df(/3ol + X/3) = 1 + E[rank(MX^)], 

where A = A{y) is the active set of a solution f3{y) at y (these are the 
nonintercept coefficients). In other words, the degrees of freedom is one plus 
the expected dimension of the subspace spanned by the active variables, once 
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we have centered these variables. A similar result holds for an arbitrary set of 
unpenalized coefficients, by replacing M above with the projection onto the 
orthogonal complement of the column space of the unpenalized variables, 
and 1 above with the dimension of the column space of the unpenalized 
variables. 

As mentioned in the Introduction, a nice feature of the full column rank 
result (2) is its interpretability and its explicit nature. The general result is 
also explicit in the sense that an unbiased estimate of degrees of freedom 
can be achieved by computing the rank of a given matrix. In terms of in- 
terpretability, when rank(X) = p, the degrees of freedom of the lasso fit is 
E|^| — this says that, on average, the lasso "spends" the same number of 
parameters as does linear regression on |^| linearly independent predictor 
variables. Fortunately, a similar interpretation is possible in the general case: 
we showed in Theorem 2 that for a general predictor matrix X, the degrees 
of freedom of the lasso fit is E[rank(X_4)], the expected dimension of the 
linear subspace spanned by the active variables. Meanwhile, for the linear 
regression problem 

(30) /3^ = argmin||y-X^/3^||2, 

where we consider A fixed, the degrees of freedom of the fit is tr(X_4(X_4)"*") = 
rank(X_4). In other words, the lasso adaptively selects a subset A of the vari- 
ables to use for a linear model of y, but on average it only "spends" the same 
number of parameters as would linear regression on the variables in A^ if A 
was pre-specified. 

How is this possible? Broadly speaking, the answer lies in the shrinkage 
due to the penalty. Although the active set is chosen adaptively, the lasso 
does not estimate the active coefficients as aggressively as does the corre- 
sponding linear regression problem (30); instead, they are shrunken toward 
zero, and this adjusts for the adaptive selection. Differing views have been 
presented in the literature with respect to this feature of lasso shrinkage. On 
the one hand, for example. Fan and Li (2001) point out that lasso estimates 
suffer from bias due to the shrinkage of large coefficients, and motivate the 
nonconvex SCAD penalty as an attempt to overcome this bias. On the other 
hand, for example, Loubes and Massart (2004) discuss the merits of such 
shrunken estimates in model selection criteria, such as (9). In the current 
context, the shrinkage due to the £i penalty is helpful in that it provides 
control over degrees of freedom. A more precise study of this idea is the 
topic of future work. 

4. The generalized lasso. In this section we extend our degrees of free- 
dom results to the generalized lasso problem, with an arbitrary predictor 
matrix X and penalty matrix D. As before, the KKT conditions play a cen- 
tral role, and we present these first. Also, many results that follow have 
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equivalent derivations from the perspective of the generalized lasso dual 
problem; see Appendix A. 5. We remind the reader that Dn is used to ex- 
tract to extract rows of D corresponding to an index set R. 

4.1. The KKT conditions and the underlying polyhedron. The KKT con- 
ditions for the generalized lasso problem (3) are 

(31) X^{y-XP) = D^X^, 



(32) 7i e 



{sign((Z)/3),)} if(I)/3),/0, 
[-1,1] if(I)/3). = 0. 

Now 7 E is a subgradient of the function f{x) = \\x\\i evaluated at x = 
Dp. Similar to what we showed for the lasso, it follows from the KKT 
conditions that the generalized lasso fit is the residual from projecting y 
onto a polyhedron. 

Lemma 8. For any X and A > 0, the generalized lasso fit can be written 
as X(3{y) = [I — Pc){y), where C C is the polyhedron 

C = {n E M" : X^u = D^w for w E M"", ||w||oo < A}. 

Proof. The proof is quite similar to that of Lemma 3. As in (16), we 
want to show that 

(33) {X^,y-X^)-{X'^u,P)>0 

for all u £ C, where C is as in the lemma. For the first term above, we can 
take an inner product with 13 on both sides of (31) to get {X(3,y — Xj3) = 
A||D/3||i, and furthermore, 

\\\D^\\i= max {w,Dj3)= max {D'^wJ). 

1 1 1 1 c>o 

<A II 

II oo 

<A 

Therefore (33) holds if X'^u = D^w for some ll'it'lloo — A, in other words, if 
u £ C. To show that C is a polyhedron, note that we can write it as C = 
{X^)~^ {D^ (B)) where {X'-^)~^ is taken to mean the inverse image under 
the linear map X'^ , and B = {w E : HfwHoo < A}, a hypercube in W^. 
Clearly is a polyhedron, and the image or inverse image of a polyhedron 
under a linear map is still a polyhedron. □ 

As with the lasso, this lemma implies that the generalized lasso fit X/3{y) 
is nonexpansive, and therefore continuous and almost differentiable as a func- 
tion of y, by Lemma 1. This is important because it allows us to use Stein's 
formula when computing degrees of freedom. 

In the next section we define the boundary set B, and derive expressions 
for the generalized lasso fit and solutions in terms of B. The following section 
defines the active set A in the generalized lasso context, and again gives 
expressions for the fit and solutions in terms of A. Though neither B nor A 
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are necessarily unique for the generalized lasso problem, any choice of or ^ 
generates a special invariant subspace (similar to the case for the active sets 
in the lasso problem). We are subsequently able to express the degrees of 
freedom of the generalized lasso fit in terms of any boundary set B, or any 
active set A. 

4.2. The boundary set. Like the lasso, the generalized lasso fit X/3 is 
always unique (following from Lemma 8, and the fact that projection onto 
a closed convex set is unique). However, unlike the lasso, the optimal sub- 
gradient 7 in the generalized lasso problem is not necessarily unique. In par- 
ticular, if rank(D) < m, then the optimal subgradient 7 is not uniquely de- 
termined by conditions (31) and (32). Given a subgradient 7 satisfying (31) 
and (32) for some /3, we define the boundary set B as 

fi = {iG{l,...,m}:|7i| = l}. 

This generalizes the notion of the equicorrelation set £ in the lasso problem 
[though, as just noted, the set B is not necessarily unique unless rank(Z)) = m\. 
We also define 

S = 1B- 

Now we focus on writing the generalized lasso fit and solutions in terms 
of B and s. Abbreviating P = -Pnuii(D_B)5 wte that we can expand PD^X'y = 
PD^Xs + PD^i^Xj-is = PD^Xs. Therefore, multiplying both sides of (31) 
by P yields 

(34) PX^ {y - Xp) = PD^Xs. 

Since PD^Xs G col(PX^), we can write PD'^Xs = {PX'^){PX^)+ PD^Xs = 

{PX^){PX'^)+DlXs. Also, we have L»_b/3 = by definition oiB, so Pp = /3. 
These two facts allow us to rewrite (34) as 

PX^XPP = PX^{y - {PX^)+D^Xs), 

and hence the fit X(] = XPjS is 

(35) X/3 = (XP„,u(^_g))(XP„,mo_^))+(y - (P„u11(d.b)^'^)+^bAs), 

where we have un-abbreviated P = -Pnuii(D_B)- Further, any generalized lasso 
solution is of the form 

(36) ^ = (XP„,u(d_8))''(?/ - {PnnHD^s)X^)^DlXs) + 6, 

where b G null(XPi-mU(£) g)). Multiplying the above equation by -D-b, and re- 
calling that -D-b/3 = 0, reveals that b S null(-D_B); hence b S m\\\{XP^^\\(^£,_^^)r\ 
nun(L»_B) = nun(X) n null(i:>_B). In the case that null(X) n null(D_B) = 
{0}, the generalized lasso solution is unique and is given by (36) with 
6 = 0. This occurs when rank(X) = p, for example. Otherwise, any b G 
null(X) n null(Z)_B) gives a generalized lasso solution in (36) as long as 
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it also satisfies the sign condition 

sign(A(^Pnull(D_s))^(y - {Pnun(D^s)X^)~'D^Xs) + Dib) = Si 

(37) for each i £ B such that Dj(XPjmU(£)_g))^ 

X{y- {Pnnll(D^s)^^)^D],Xs) + D,b / 0, 

necessary to ensure that 7 is a proper subgradient of HD/JHi. 

4.3. The active set. We define the active set of a particular solution /3 as 
A = {i£{l,...,m}:{Dp)i^O}, 

which can be alternatively expressed as ^ = supp(D/3). If /3 corresponds 
to a subgradient with boundary set B and signs s, then A B; in par- 
ticular, given B and s, different active sets A can be generated by taking 
b G null(X) n nu\l{D^is) such that (37) is satisfied, and also 

A(XP„,u(D_H))+(y - (^'nuii(D_B)^^)+^sAs) + Afo = for ieB\ A. 

If rank(X) = p, then 6 = 0, and there is only one active set A; however, in 
this case, A can still be a strict subset of B. This is quite different from 
the lasso problem, wherein A = £ for almost every y whenever rank(X) =p. 
[Note that in the generalized lasso problem, rank(X) = p implies that A is 
unique but implies nothing about the uniqueness of B — this is determined by 
the rank of D. The boundary set B is not necessarily unique if rank(D) < m, 
and in this case we may have -Dj(^-fnuii(D_s))''^ = for some i £ B, which 
certainly implies that i ^ A for any y S M."". Hence some boundary sets may 
not correspond to active sets at any y.] We denote the signs of the active 
entries in by 

r = sign.{D_Af3), 

and we note that r = s_a. 

Following the same arguments as those leading up to the expression for 
the fit (35) in Section 4.2, we can alternatively express the generalized lasso 
fit as 

(38) X/3 = (XP„,u(D_^))(^^'null{D_^)) + (y - {Pnnll(D^^)X'^)''D^Xr), 

where A and r are the active set and signs of any solution. Computing 
the divergence of the fit in (38), and pretending that A and r are con- 
stants (not depending on y), gives (V • Xf3){y) = dim(col(XPnuii(D_^))) = 
dim(X(null(I?_^))). The same logic applied to (35) gives (V • Xf3){y) = 
dim(X(null(D_e))). The next section shows that, for almost every y, the 
quantities A,r or B,s can indeed be treated as locally constant in ex- 
pressions (38) or (35), respectively. We then prove that linear subspaces 
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X(null(L'_0)), X(null(D__4)) are invariant under all choices of boundary 
sets B, respectively active sets A, and that the two subspaces are in fact 
equal, for almost every y. Furthermore, we express the generalized lasso 
degrees of freedom in terms of any boundary set or any active set. 

4.4. Degrees of freedom. We call {'y{y), (3{y)) an optimal pair provided 
that 7(y) and jointly satisfy the KKT conditions, (31) and (32), at y. 
For such a pair, we consider its boundary set 13{y), boundary signs s{y), 
active set A{y), active signs r{y), and show that these sets and sign vectors 
possess a kind of local stability. 

Lemma 9. There exists a sei A^CM", of measure zero, with the follow- 
ing property: for y ^ M, and for any optimal pair {"fiy), I3{y)) with boundary 
set B{y), boundary signs s{y), active set A{y), and active signs r{y), there 
is a neighborhood U of y such that each point y' £ U yields an optimal 
pair {'y{y'), I3{y')) with the same boundary set B{y') = B{y), boundary signs 
s{y') = s{y), active set A{y') = A{y) and active signs r{y') = r{y). 

The proof is delayed to Appendix A. 4, mainly because of its length. Now 
Lemma 9, used together with expressions (35) and (38) for the generalized 
lasso fit, implies an invariance in representing a (particularly important) 
linear subspace. 

Lemma 10. For the same set N C M" as in Lemma 9, and for any 
y ^ M, the linear subspace L = X(null(-D_B)) is invariant under all boundary 
sets B = B{y) defined in terms of an optimal subgradient at j{y) at y. The 
linear subspace V = X(null(L'_^)) is also invariant under all choices of 
active sets A = A{y) defined in terms of a generalized lasso solution /3{y) 
at y. Finally, the two subspaces are equal, L = V . 

Proof. Let y ^ AA, and let 7(7/) be an optimal subgradient with bound- 
ary set B = B{y) and signs s = s{y). Let U be the neighborhood of y over 
which optimal subgradients exist with boundary set B and signs s, as given 
by Lemma 9. Recalling the expression for the fit (35), we have that for every 
y'GU 

XP{y') = (XP,,u(^_^))(XP,,u(D_B)) + (y' - {PnnlliD^s)^^)^Dl\s). 

If P{y) is a solution with active set A = A{y) and signs r = r(y), then again 
by Lemma 9 there is a neighborhood V oi y such that each point y' 
yields a solution with active set A and signs r. [Note that V and U are not 
necessarily equal unless 7(7/) and /3(y) jointly satisfy the KKT conditions 
at y] Therefore, recalling (35), we have 

XP{y') = (XP,,u(D_^))(^^„uii(D_^))^(y' - {Pnnll(D.^)X^)^D^Xr) 
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for each y' ^V. The uniqueness of the generahzed lasso fit now impUes that 

for all y' £ U nV. As U nV is open, for any z G col(XPnuii(D_B))7 there 
exists an e > such that y + ez £ U CiV. Plugging y' = y + ez into the 
equation above reveals that z G col(XP^^^(-£)_^)), hence col{XP^^in^j^_^-^) C 
col(XPnuU(-£)_^)). The reverse inclusion follows similarly, and therefore 
col(XPnuii(D_B)) = col(XPjmU(£)__^)). Finally, the same strategy can be used 
to show that these linear subspaces are unchanged for any choice of bound- 
ary set B = B{y), coming from an optimal subgradient at y and for any 
choice of active set A = A{y) coming from a solution at y. Noticing that 
col(MPimU(7v)) = M(null(A^)) for matrices M,N gives the result as stated in 
the lemma. □ 

This local stability result implies the following theorem. 

Theorem 3 (Generalized lasso degrees of freedom). Assume that y fol- 
lows a normal distribution (5). For any X,D and A > 0, the degrees of 
freedom of the generalized lasso fit can be expressed as 

di{X/3) = E[dim(X(null(L»_B)))], 

where B = B{y) is the boundary set corresponding to any optimal subgradi- 
ent 7(y) of the generalized lasso problem at y. We can alternatively express 
degrees of freedom as 

df(X^) = E[dim(X(null(Z)_^)))], 

with A = A{y) being the active set corresponding to any generalized lasso 
solution f3{y) at y. 

Note: Lemma 10 implies that for almost every y G M", for any B defined in 
terms of an optimal subgradient, and for any A defined in terms of a gener- 
alized lasso solution, dim(X(null(-D_5))) = dim(X(null(-D_^))). This makes 
the above expressions for degrees of freedom well defined. 

Proof of Theorem 3. First, the continuity and almost differentiabil- 
ity of X(3{y) follow from Lemmas 1 and 8, so we can use Stein's formula (10) 
for degrees of freedom. Let y ^ AA, where N is the set of measure zero as 
in Lemma 6. If = B{y) and s = s{y) are the boundary set and signs of an 
optimal subgradient at y, then by Lemma 10 there is a neighborhood U oiy 
such that each point y' £ U yields an optimal subgradient with boundary 
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set B and signs s. Therefore, taking the divergence of the fit in (35), 

(V-X/3)(y) =tr(Px(null{D_B))) =dim(X(null(D_e))), 

and taking an expectation over y gives the first expression in the theorem. 

Similarly, if ^ = A{y) and r = r{y) are the active set and signs of a gen- 
eralized lasso solution at y, then by Lemma 10 there exists a solution with 
active set A and signs r at each point y' in some neighborhood V of y. The 
divergence of the fit in (38) is hence 

(V • XP){y)=ti{Px(^nulliD^^))) = dim(X(nun(D_^))), 
and taking an expectation over y gives the second expression. □ 

Remark (Full column rank X). If rank(X) = p, then dim(X(L)) = 
dim(L) for any linear subspace L, so the results of Theorem 3 reduce to 

df{Xfi) = E[nullity(Z)„B)] = E[nullity(L>„^)]. 

The first equality above was shown in Tibshirani and Taylor (2011). Ana- 
lyzing the null space of -D-b (equivalently, D-j[) for specific choices of D 
then gives interpretable results on the degrees of freedom of the fused lasso 
and trend filtering fits as mentioned in the introduction. It is important to 
note that, as rank(X) = p, the active set A is unique, but not necessarily 
equal to the boundary set B [since B can be nonunique if rank(D) < m]. 

Remark (The lasso). If D = I, then X{null{D_s)) = col{Xs) for any 
subset 5" C {1, ... ,1)}. Therefore the results of Theorem 3 become 

df{X/3) = E[rank(XB)] = E[rank(X^)], 

which match the results of Theorems 1 and 2 (recall that for the lasso the 
boundary set B is exactly the same as equicorrelation set £). 

Remark (The smallest active set). Recent and independent work of 
Vaiter et al. (2011) shows that, for arbitrary X,D and for any y, there 
exists a generalized lasso solution whose active set A* satisfies 

nun(x) n nun(L»_^*) = {o}. 

(Calling A* the "smallest" active set is somewhat of an abuse of terminology, 
but it is the smallest in terms of the above intersection.) The authors then 
prove that, for any X, D, the generalized lasso fit has degrees of freedom 

df(X/3) = E[nullity(I)_^*)], 
with A* the special active set as above. This matches the active set result 
of Theorem 3 applied to A* , since dim(X(null(L'__4*))) = nullity (Z)_y^*) for 
this special active set. 

We conclude this section by comparing the active set result of Theorem 3 
to degrees of freedom in a particularly relevant equality constrained linear 
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regression problem (this comparison is similar to that made in lasso case, 
given at the end of Section 3). The result states that the generalized lasso fit 
has degrees of freedom E[dim(X(null(L'_^)))], where A = A{y) is the active 
set of a generalized lasso solution at y. In other words, the complement of A 
gives the rows of D that are orthogonal to some generalized lasso solution. 
Now, consider the equality constrained linear regression problem 

(39) /3 E argmin||y subject to L*.^/? = 0, 

in which the set A is fixed. It is straightforward to verify that the fit of 
this problem is the projection map onto col(XPnuU(-£)_^)) = X(null(D__4)), 
and hence has degrees of freedom dim(X(null(L'__4))). This means that the 
generalized lasso fits a linear model of y, and simultaneously makes the coef- 
ficients orthogonal to an adaptive subset A of the rows of D, but on average 
it only uses the same number of parameters as does the corresponding equal- 
ity constrained linear regression problem (39), in which A is pre-specified. 

This seemingly paradoxical statement can be explained by the shrinkage 
due to the ii penalty. Even though the active set A is chosen adaptively 
based on y, the generalized lasso does not estimate the coefficients as ag- 
gressively as does the equality constrained linear regression problem (39), 
but rather, it shrinks them toward zero. Roughly speaking, his shrinkage can 
be viewed as a "deficit" in degrees of freedom, which makes up for the "sur- 
plus" attributed to the adaptive selection. We study this idea more precisely 
in a future paper. 

5. Discussion. We showed that the degrees of freedom of the lasso fit, for 
an arbitrary predictor matrix X, is equal to E[rank(X_4)]. Here A = A{y) 
is the active set of any lasso solution at y, that is, A{y) = supp(/3(y)). This 
result is well defined, since we proved that any active set A generates the 
same linear subspace col{Xj(), almost everywhere in y. In fact, we showed 
that for almost every y, and for any active set ^ of a solution at y, the lasso 
fit can be written as 

xky') = Pcoiix^)iy') + c 

for all y' in a neighborhood of y, where c is a constant (it does not depend 
on y'). This draws an interesting connection to linear regression, as it shows 
that locally the lasso fit is just a translation of the linear regression fit of 
on Xj^. The same results (on degrees of freedom and local representations of 
the fit) hold when the active set A is replaced by the equicorrelation set S. 

Our results also extend to the generalized lasso problem, with an arbitrary 
predictor matrix X and arbitrary penalty matrix D. We showed that degrees 
of freedom of the generalized lasso fit is E[dim(X(null(L'-^)))], with A = 
A{y) being the active set of any generalized lasso solution at y, that is. 
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A{y) = supp{DI3{y)). As before, this result is well defined because any choice 
of active set A generates the same linear subspace X(null(L'_^)), almost 
everywhere in y. Furthermore, for almost every y, and for any active set of 
a solution at y, the generalized lasso fit satisfies 

= Px{nnll{D_J^)){y') + c 

for all y' in a neighborhood of y, where c is a constant (not depending on y). 
This again reveals an interesting connection to linear regression, since it says 
that locally the generalized lasso fit is a translation of the linear regression 
fit on X, with the coefficients /3 subject to D-_/(f3 = 0. The same statements 
hold with the active set A replaced by the boundary set B of an optimal 
subgradient. 

We note that our results provide practically useful estimates of degrees of 
freedom. For the lasso problem, we can use rank(X_4) as an unbiased esti- 
mate of degrees of freedom, with A being the active set of a lasso solution. 
To emphasize what has already been said, here we can actually choose any 
active set (i.e., any solution), because all active sets give rise to the same 
rank(X_4), except for y in a set of measure zero. This is important, since 
different algorithms for the lasso can produce different solutions with dif- 
ferent active sets. For the generalized lasso problem, an unbiased estimate 
for degrees of freedom is given by dim(X(null(L'_^))) = rank(XPjjuU(£)__^)), 
where A is the active set of a generalized lasso solution. This estimate is 
the same, regardless of the choice of active set (i.e., choice of solution), for 
almost every y. Hence any algorithm can be used to compute a solution. 

APPENDIX A: PROOFS AND TECHNICAL ARGUMENTS 

A.l. Proof of Lemma 1. The proof relies on the fact that the projection 
Pcix) of X G M" onto a closed convex set C C satisfies 

(40) {x-Pc{x),Pc{x)-u)>0 for any n G C. 

First, we prove the statement for the projection map. Note that 

\\Pc{x) - Pc{y)\\l 

= {Pc{x) -x + y- Pc{y) +x-y, Pc{x) - Pc{y)) 

= {Pc{x) - X, Pc{x) - Pc{y)) + {y- Pc{y),Pc{x) - Pc{y)) 

+ {x-y,Pc{x)-Pc{y)) 

<{x-y,Pc{x)-Pc{y)) 

<\\x-y\\2\\Pc{x)-Pc{y)\\2, 

where the first inequality follows from (40), and the second is by Cauchy- 
Schwarz. Dividing both sides by [[^(^(x) — Pc{y)\\2 gives the result. 
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Now, for the residual map, the steps are similar. 

Wil - Pc){x) - {I - Pc){y)\\l 

= (Pciy) - Pc{x) +x-y,x- Pc{x) + Pc{y) - y) 

= {Pciy) -Pcix),x- Pcix)) + {Pciy) - Pcix),Pciy) - y) 

+ {x-y,x- Pcix) + Pciy) - y) 
<{x-y,x- Pcix) + Pciy) - y) 

<\\x - yUil - Pc)ix) - il - Pc)iy)h. 

Again the two inequalities are from (40) and Cauchy-Schwarz, respectively, 
and dividing both sides by ||(/ — Pc)ix) — {I — Pc)iy)\\2 gives the result. 

We have shown that Pc and / — Pc are Lipschitz (with constant 1); 
they are therefore continuous, and almost differentiability follows from the 
standard proof of the fact that a Lipschitz function is differentiable almost 
everywhere. 

A. 2. Proof of Lemma 2. We write J- to denote the set of faces of C. To 
each face F ^ there is an associated normal cone NiF), defined as 

NiF) = |xGM":F = argmaxa;^y|. 

The normal cone of F satisfies iV(F) = Pc^iu) - u for any u G relint(F). 
[We use relint(^) to denote the relative interior of a set and relbd(A) to 
denote its relative boundary.] 
Define the set 

5 = IJ (relint(F) + relint(iV(F))). 

Because C is a polyhedron, we have that dim(F) + dim(A^(F)) = n for each 
F ^ T, and therefore each Up = relint(F) + relint(A^(i<')) is an open set 
in R". 

Now let X £ S. We have x £ Up for some F £ J^, and by construction 
PciUp) = relint(F). Furthermore, we claim that projecting x £ Up onto C 
is the same as projecting x onto the affine hull of F, that is, PciUp) = 
Paf[{P)iUp). Otherwise there is some y £ Up with Pciy) / -Paff(F)(y)i aiid as 
aff(F) 2 F, this means that ||y — -Paff(F)(y)l|2 < — -fb(y)||2- By definition of 
relint(F), there is some a £ (0, 1) such that u = aPciy) + (1 — a)-Paff(F) ^ 
But \\y - u\\2 < a\\y - Pciy)\\2 + (1 - a)\\y - Paff(F)(2/)l|2 < \\y - Pciy)\\2, 
which is a contradiction. This proves the claim, and writing aff (-F) = L + a, 
we have 

Pciy) = PLiy-a) + a iory£Up, 

as desired. 
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It remains to show that S'^ = \ S has measure zero. Note that S'^ 
contains points of the form u + x, where either: 

(1) u G relbd(F), x G N{F) for some F with dim(F) > 1; or 

(2) u G rehnt(F), X G relbd(iV(F)) for some F^C. 

In the first type of points above, vertices are exchided because relbd(F) = 
when F is a vertex. In the second type, C is excluded because relbd(iV(C)) = 
0. The lattice structure of F tells us that for any face F £ F, we can write 
relbd(F) = [Jq^jt Q(2py^e^^'n.t{G) . This, and the fact that the normal cones 
have the opposite partial ordering as the faces, imply that points of the first 
type above can be written as u' + x' with u' G relint(G) and x' G N{G) for 
some G F. Note that actually we must have x' G relbd(A^(G)) because 
otherwise we would have u' + x' G 5. Therefore it suffices to consider points 
of the second type alone, and S'^ can be written as 

5^= y (relint(F) +relbd(iV(F))). 

As C is a polyhedron, the set F of its faces is finite, and dim(relbd(A^(F))) < 
n — dim(i^) — 1 for each F £ F, F ^ G . Therefore S"^ is a finite union of sets 
of dimension <n — 1, and hence has measure zero. 

A. 3. Proof of Lemma 6. First some notation. For S {1, . . . ,k}, define 
the function tts'-^'' — )• K''^' by TTsix) = xs- So vrg just extracts the coordi- 
nates in S. 

Now let 

£,sA£Z(£) 

The first union is taken over all possible subsets £ {1, . . . ,p} and all sign 
vectors s G { — 1, 1}'^'; as for the second union, we define for a fixed subset £ 

Z{£) = {AC£: P[._^(nuii(X£)r i(Xs)'-](^A,-) + 0}. 

Notice that 7W is a finite union of affine subspace of dimension < n — 1 , and 
hence has measure zero. 

Let y ^ and let /3(y) be a lasso solution, abbreviating A = A{y) and 
r = r{y) for the active set and active signs. Also write £ = £{y) and s = s{y) 
for the equicorrelation set and equicorrelation signs of the fit. We know 
from (22) that we can write 

P-e{y) = ^ and k{y) = {X£)+{y-{Xj)+Xs) + h, 
where h G mx\\{Xs) is such that 

h\A{y) = [{Xs)\-A,-){y - iXj)+\s) + h.A = 0. 
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In other words, 

so projecting onto the orthogonal complement of the linear subspace 
7r__4(null(Xf:)) gives zero, 

Since y ^ M, we know that 

and finally, this can be rewritten as 
(41) col([(X^)+](_^,.)) C vr_^(nun(X^)). 

Consider defining, for a new point y' , 

/3„ay') = and k{y') = {Xe)+{y'-{Xj)+Xs) + b', 

where b' G null(X£:), and is yet to be determined. Exactly as in the proof of 

Lemma 5, we know that Xj(7/'-X/3(y')) = As, and \\X'^^{y' - X^{y'))\\oo < A 
for all y' gUi, a neighborhood of y. 

Now we want to choose b' so that /3{y') has the correct active set and active 
signs. For simplicity of notation, first define the function / : M" — t- RI^I , 

/(x) = (X^)+(x-(Xj)+As). 

Equation (41) implies that there is a 6' G null(X£:) such that b'__^ = — f-A{y')i 
hence l3E\A{y') — 0- However, we must choose b' so that additionally /3i{y') ^ 
for i S ^ and sign(/3_4(y')) = r. Write 

kiy') = ifiy') + b) + {b'-b). 

By the continuity of / + 6, there exits a neighborhood of U2 of y such that 
fiiy') + bi ^ for i E A and sign(/^(?/') + 6^) = r, for all y' S U2- There- 
fore we only need to choose a vector b' S null(X£), with b'__^ = — /-^(y')) 
such that 1 1 6' — 6| 1 2 sufficiently small. This can be achieved by applying the 
bounded inverse theorem, which says that the bijective linear map tt-a has 
a bounded inverse (when considered a function from its row space to its col- 
umn space). Therefore there exists some M > such that for any y' , there 
is a vector b' G null(X^), b'__^ = —f-A{y')-, with 

||6'-6||2<A-/||/-^(y')-/-^(y)ll2. 

Finally, the continuity of f-A implies that \\f~A{y') ~ f~A{y)\\2 can be made 
sufficiently small by restricting y' GU3, another neighborhood of y. 

Letting U = C/inC/2nC/3, we have shown that for any y' £U, there exists 
a lasso solution /3(y') with active set A{y') = A and active signs r(y') = r. 
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A. 4. Proof of Lemma 9. Define the set 

= U U ^ ■ -P[D8\^(null(X)nnull{D_B))]i ' D s\_a{X P^^^u _ 

B,sAeZ{l3) 

The first union above is taken over all subsets B Q {1, . . . ,m} and all sign 
vectors s G {— l,!}'^'. The second union is taken over subsets A C Z{B), 
where 

Z{B) = {A ^ S:-P[Z)g\^(null(X)nnull(D_e))]i^e\^(-^^null(Z)_e))'^ 7^ 0}- 

Since A/" is a finite union of affine subspaces of dimension < n — 1, it has 
measure zero. 

Now fix y ^ M, and let (7(7/), be an optimal pair, with boundary 
set B = B{y), boundary signs s = s{y), active set A = A{y), and active signs 
r = r{y). Starting from (34), and plugging in for the fit in terms of B^s, as 
in (35) we can show that 

+ (X^(Pnull(D_8)^^)+ - I)Dl\s) + C, 

where c e null(L>;?^g). By (36), we know that 

/3(y) = (XP,,u(D_B))^(y - {Pnuii{D^s)X^)^ DlXs) + 6, 
where h G null(X) n null(Z)_g). Furthermore, 

DbXaHv) = DB\A{XPnnlliD_s))^{y " (Pnull(D_B)^'^) + ^B As) + Dis\^b = 0, 

or equivalently, 

DB\AiXPnnniD^e))^{y " (^null(D_8)^'")^^B As) 

= -DB\Ab G Ds\A{nnll{X) n null{D.B))- 

Projecting onto the orthogonal complement of the linear subspace 
Z)0\_4(null(X) n null(D_0)) therefore gives zero, 

^pB\>t(null(X)nnull(D_B))]-L-DB\^(XPnun(Z)_B))''^(y-(-Pnull(D_B)-'^^)'^^BAs) = 0, 

and because y ^ M, we know that in fact 

^[DB\^{nu\l(X)nnu\l{D^B))]^^>3\AiXPnulliD^B))'^ = ^■ 

This can be rewritten as 

(42) co\{Dis\AXPnnniD.B))^) ^ Dis\A^ml\iX) n nuWiD^e)). 
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At a new point y' , consider defining 75(1/') = s, 
1-B{y') = A"H^^^) + (^^ W..n(z,.,)X-)2/' 

+ (X^(P„,u(D_8)^^)^ - I)Dl\s) + c, 

and 

where b' G null(X) n null(Z)_g) is yet to be determined. By construction, 
7(y') and f3{y') satisfy the stationarity condition (31) at y' . Hence it remains 
to show two parts: first, we must show that this pair satisfies the subgradi- 
ent condition (32) at y'; second, we must show this pair has boundary set 
13{y') =13, boundary signs s{y') = s, active set A{y') =A and active signs 
^(y') = y- Actually, it suffices to show the second part alone, because the first 
part is then implied by the fact that 7(y) and /3(y) satisfy the subgradient 
condition at y. Well, by the continuity of the function / :M" — t- R"*~I^I, 

fix) = A-^(I)^B)^(^^^'null(P„..,,_^,X-)^ 

+ (X^(P„,u(z5_e)^^)^ - nD^Xs) + c, 

we have ||7~B(y')lloo < 1 provided that y' € Ui, a neighborhood of y. This 
ensures that 7(1/') has boundary set 13{y') = B and signs s{y') = s. 

As for the active set and signs of /3(y'), note first that Dsl3{y') = 0, 
following directly from the definition. Next, define the function g : M" — )• W, 

g{x) = (XP„,u(D_h))"'(^ - {Pnnll[D^rs)^^)^Dl\s), 

SO P{y') = g{y') + b' . Equation (42) implies that there is a vector b' G null(X)n 
null(Z)_B) such that Dj^-^jb' = —DQ\_^g{y'), which makes DQ\^_^j3{y') = 0. 
However, we still need to choose b' such that Dil3{y') ^ for alH G ^ and 
sign(Z)_4/3(y')) = r. To this end, write 

^iy') = {giy') + b) + ib'-b). 

The continuity of D^xg implies that there is a neighborhood U2 of y such 
that Dig{y')+Dib / for aXM^A and sygn{Dji,g{y') + Djij3) = r, for y' £ U2. 
Since 

I A/3(y)| > \Dig{y') + D^b] - \Di{b' - b)\ 

>\Dig{y')+D^b\-\\D^\\2\\b' -b\\2, 

where ||D-^||2 is the operator norm of the , we only need to choose b' G 
null(X) nnull(D_B) such that Di3\_^b' = —DQ\^g{y'), and such that — 6II2 
is sufficiently small. This is possible by the bounded inverse theorem applied 
to the linear map Dq\_^: when considered a function from its row space to its 



DEGREES OF FREEDOM IN LASSO PROBLEMS 



33 



column space, Dq\^_^ is bijective and hence has a bounded inverse. Therefore 
there is some Ad > such that for any y' , there is a 6' € null(X) n null(D_5) 
with DQ\ji^b' = -Dts\j^g{y') and 

\\h'-hh<M\\Dr,\^g{y')-Dr,\j,g{y)h. 

The continuity of D]g\^^g implies that the right-hand side above can be made 
sufficiently small by restricting y' £17^, a neighborhood of y. 

With U = Ui D U2 ri U3, we have shown for that for y' & there is an 
optimal pair {'^{y')-, P{y')) with boundary set B{y') = B, boundary signs 
•s(y') = s, active set A{y') = A and active signs r(y') = r. 

A. 5. Dual problems. The dual of the lasso problem (1) has appeared in 
many papers in the literature; as far as we can tell, it was first considered by 
Osborne, Presnell and Turlach (2000). We start by rewriting problem (1) as 

/3, argmin -||y — ^;||2 + subject to z = 

then we write the Lagrangian 

C{f3, z,v) = - z\\l + A||/3||i + v^{z - X/3), 
and we minimize C over P,z to obtain the dual problem 

(43) {) = argmin ||y — t;||2 subject to ||X^?;||oo < A. 

Taking the gradient of £ with respect to to (3,z, and setting this equal to 
zero gives 

(44) v = y-Xp, 

(45) X^v = A7, 

where 7 G is a subgradient of the function f{x) = \\x\\i evaluated at 
X = /3. From (43), we can immediately see that the dual solution v is the 
projection of y onto the polyhedron C as in Lemma 3, and then (44) shows 

that X/S = y — v is the residual from projecting y onto C. Further, from (45), 
we can define the equicorrelation set £ as 

£ = {ie{l,...,p}:\Xfv\ = X}. 

Noting that together (44), (45) are exactly the same as the KKT condi- 
tions (13), (14), and all of the arguments in Section 3 involving the equicor- 
relation set £ can be translated to this dual perspective. 

There is a slightly different way to derive the lasso dual, resulting in a dif- 
ferent (but of course, equivalent) formulation. We first rewrite problem (1) 
as 

/3,zG argmin -||y — X/3||2 + A||2;||i subject to 2; = /3, 
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and by following similar steps to those above, we arrive at the dual problem 

(46) i) € argmin||P(,ol(X)y ~ (^^)"^^ll2 subject to ||f ||oo < A/y G row(X). 

Each dual solution v (now no longer unique) satisfies 

(47) {X+fv = P,,,^x)y-X^, 

(48) V = A7. 

The dual problem (46) and its relationship (47), (48) to the primal problem 
off'er yet another viewpoint to understand some of the results in Section 3. 

For the generalized lasso problem, one might imagine that there are three 
different dual problems, corresponding to the three different ways of intro- 
ducing an auxiliary variable z into the generalized lasso criterion: 

/3,£e argmin -||y — z||2 + AUD/SHi subject to z = 



P,z£ argmin -||?/ — X/3||2 + A||D2:||i subject to z = /3; 



P,z£ argmin -||y — X/3||2 + A||z||i subject to z = -D/3. 

However, the first two approaches above lead to Lagrangian functions that 
cannot be minimized analytically over (3,z. Only the third approach yields 
a dual problem in closed-form, as given by Tibshirani and Taylor (2011), 

V G argmin||Pcol{X)y - (-'^'^)^-C>'^^^||2 

(49) 

subject to ||f lloo < X,D v G row(X). 
The relationship between primal and dual solutions is 

(50) {X+fD'^v = P,,i^x)y-X^, 

(51) v = X-f, 

where 7 € M™ is a subgradient of f{x) = \\x\\i evaluated at x = D/3. Directly 
from (49) we can see that (X^)'^ D'^v is the projection of the point y' = 
Pco\(x)y onto the polyhedron 

K = {{X+fD^v : Halloo < X,D^v G row(X)}. 

By (50), the primal fit is X/3 = (/ — PK){y'), which can be rewritten as 
X/3 = (I — Pc){y') where C is the polyhedron from Lemma 8, and finally 
Xp = {I — Pc){y) because I — Pc is zero on null(X-^). By (51), we can define 
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the boundary set B corresponding to a particular dual solution v as 

B = {i£{l,...,m}:\vi\ = X}. 

(This explains its name, as B gives the coordinates of v that are on the 
boundary of the hox{v: \\v\\oo < A}.) As (50), (51) are equivalent to the KKT 
conditions (31), (32) [following from rewriting (50) using D^'^v G row(X)], the 
results in Section 4 on the boundary set B can all be derived from this dual 
setting. 
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